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Abstract 

Recent research on multiple kernel learning has lead to a number of 
l_^ ' approaches for combining kernels in regularized risk minimization. The 

proposed approaches include different formulations of objectives and vary- 
ing regularization strategies. In this paper we present a unifying general 
optimization criterion for multiple kernel learning and show how existing 
formulations are subsumed as special cases. We also derive the criterion's 
^ ' dual representation, which is suitable for general smooth optimization al- 

gorithms. Finally, we evaluate multiple kernel learning in this framework 
analytically using a Rademacher complexity bound on the generalization 
^ ' error and empirically in a set of experiments. 

cn 

^ ■ 1 Introduction 

O 

\f~\ • Selecting a suitable kernel for a kernel-based [T7J machine learning task can be 

^P I a difficult task. From a statistical point of view, the problem of choosing a good 

kernel is a model selection task. To this end, recent research has come up with 
a number of multiple kernel learning (MKL) 10 approaches, which allow for an 
automated selection of kernels from a predefined family of potential candidates. 
Typically, MKL approaches come in one of these three different flavors: 



o 



X 



(I) Instead of formulating an optimization criterion with a fixed kernel k, one 
leaves the choice of fc as a variable and demands that k is taken from a 
linear span of base kernels k :— X]i=i ^i^i- The actual learning procedure 
then optimizes not only over the parameters of the kernel classifier, but 
also over the 9 subject to the constraint that \\9\\ < 1 for some fixed 
norm. This approach is taken for instance in ^ for regression and in [7] 
for classification. 

(II) A second approach optimizes over all kernel classifiers for each of the M 
base kernels, but modifies the regularizer to a block norm, that is, a norm 
of the vector containing the individual kernel norms. This allows to trade- 
off the contributions of each kernel to the final classifier. This formulation 
was used for instance in j21 [13] . 



(Ill) Finally, since it appears to be sensible to have only the best kernels con- 
tribute to the final classifier, it makes sense to encourage sparse kernel 
weights. One way to do so is to extend the second setting with an elas- 
tic net regularizer, a linear combination of ii and £2 regularizers. This 
approach was recently described in |21) . 

While all of these formulations are based on similar considerations, the individ- 
ual formulations and used techniques vary considerably. The particular formula- 
tions are tailored more towards a specific optimization approach rather than the 
inherent characteristics. Type (I) approaches, for instance, are generally solved 
using a partially dualized wrapper approach, (II) makes use of the fact that the 
^00-norm computes a coordinatewise maximum and (III) solves MKL in the pri- 
mal. This makes it hard to gain insights into the underpinnings and differences 
of the individual methods, to design general-purpose optimization procedures 
for the various criteria and to compare the different techniques empirically. 

In this paper, we formulate MKL as an optimization criterion with a dual- 
block-norm regularizer. By using this specific form of regularization, we can 
incorporate all the previously mentioned formulations as special cases of a sin- 
gle criterion. We derive a modular dual representation of the criterion, which 
separates the contribution of the loss function and the regularizer. This allows 
practitioners to plug in specific (dual) loss functions and to adjust the regularizer 
in a fiexible fashion. We show how the dual optimization problem can be solved 
using standard smooth optimization techniques, report on experiments on real 
world data, and compare the various approaches according to their ability to 
recover sparse kernel weights. On the theoretical side, we give a concentration 
inequality that bounds the generalization ability of MKL classifiers obtained in 
the presented framework. The bound is the first known bound to apply to MKL 
with elastic net regularization and it matches the best previously known bound 
[B] for the special case of £1 and £2 regularization. 

2 Generalized MKL 

In this section we cast multiple kernel learning in a unified framework. Before 
we go into the details, we need to introduce the general setting and notation. 

2.1 Multiple Kernel Learning 

We begin with reviewing the classical supervised learning setup. Given a labeled 
sample V = {(aJi, yi)}i=i...,n, where the Xi lie in some input space X and yi € 
3^ C K, the goal is to find a hypothesis / G H, that generalizes well on new and 
unseen data. Regularized risk minimization returns a minimizer /* , 

/* e argminy Rcmp(/) + Af^(/), 

where Rcmp(/) = - X]r=i ^ (/(^O: Hi) i^ the empirical risk of hypothesis / w.r.t. 
a convex loss function ^iMxJ^— >]R,J7:'H— >Misa regularizer, and A > is 



a trade-off parameter. We consider linear models of the form 

U{^)^{w,^x)), (1) 

together with a (possibly non-linear) mapping $:<¥-> H to a Hilbert space 
H [13 and constrain the regularization to be of the form fl{f) — ^||w||2 which 
allows to kernelize the resulting models and algorithms. We will later make use 
of kernel functions k{x,x') — ($(a;), $(a;'))-^ to compute inner products in H. 
When learning with multiple kernels, we are given AI different feature map- 
pings ^m ■ <%" -^ 'Hm, rn — 1, . . . M, each giving rise to a reproducing kernel km 
of Hm ■ There are two main ways to formulate regularized risk minimization with 
MKL. The first approach introduces a linear kernel mixture kg = X]m=i ^mkn 
9m > 0. With this, one solves 



^ra^ra j 



M 



inf Cy^^\y^{^/6^Wm,'^{x,))H„.,y^\+\\we\\\ (2) 

s.t. ||0||,<1, 

with a blockwise weighted target vector wq := (^\/6iwJ , ..., \/9mwJ.j) . Al- 
ternatively, one can omit the explicit mixture vector 9 and use block-norm 
regularization instead. In this case, one optimizes 



where ||i(j||2,p — (X]m=i ll''^mllw ) denotes the ^2/^p block norm. One 
can show that ([2]) is a special case of (jS)). In particular, one can show 
that setting the block-norm parameter to p ~ -^ is equivalent to having 
kernel mixture regularization with \\6\\q < 1 [7J. This also implies that 
the kernel mixture formulation is strictly less general, because it can not 
replace block norm regularization for p > 2. Extending the block norm 
criterion to also include elastic net |23j regularization, we thus choose the 
following minimization problem as primary object of investigation in this paper: 

Primal MKL Optimization Problem 



^Y,£{{w,<^{x,))n, y.) + ^lkllL + fl'-"' 



inf C^£i{w,Hx,))n,y.) + -\\w\\l^ + ^\\w\\i, (P) 



where $ = $1 x • • • x $m denotes the cartesian product of the <I>to's. Using 
the above criterion it is possible to recover block norm regularization by setting 
/z = and the elastic net regularizer by setting p = 1. 

2.2 Convex MKL in Dual Space 

Optimization problems often have a considerably easier structure when studied 
in the dual space. In this section we derive the dual problem of the generalized 



MKL approach presented in the previous section. Let us begin with rewriting 
Optimization Problem (P) by expanding the decision values into slack variables 
as follows 



n 

inf cY^iiU, y,) + -\\w\\l^ + f^\\w\ 

i=l 

s.t. \/i : {w,'^{xi))u =U. 



(4) 



Applying Lagrange's theorem re-incorporates the constraints into the objective 
by introducing Lagrangian multipliers a e M". p The Lagrangian saddle point 
problem is then given by 



2,P" 



1 

sup inf C'V^(ii, y») + -|,-.,,z,p . o' 

i—1 
11 



\w\ 



(5) 



Setting the first partial derivatives of the above Lagrangian to zero w.r.t. w 
gives the following KKT optimality condition 



Vto : w,„ ^ [\\w\?2j\\'^^n\\^ ^+A«) y^^ai^mjxi) 



(6) 



Inspecting the above equation reveals the representation k;*^ € 
span($,„(a;i), ..., ^m{xn))- Rearranging the order of terms in the Lagrangian, 

sup -C^supf — Y^-Hti,yi)] 

z— 1 ^ ^ 

- sup I (-u;,^a,$(a;,))w - ^\\w\\lp - ^||ti;||^ J , 

lets us express the Lagrangian in terms of Fenchel-Legendre conjugate functions 
h*{x) — sup.„ x^u — h{u) as follows, 



n 

sup -c^r(- 



C' 
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^ai<^{x^) 


+ ^ 


i=i 


2,P 



^a,$(a;0 



(7) 



thereby removing the dependency of the Lagrangian on w. The function £* is 
called dual loss in the following. Recall that the Inf- Convolution [16] of two 
functions / and g is defined by (/ ® g){x) := infj^ /(a; — y) + g{y) and that 



^Note that ct is variable over the whole range of R" since it is incorporates an equality 
constraint. 



(/* ® g*){x) = (/ + g)*{x), and {r]f)*{x) = r]f*{x/ri). Moreover, we have for 
the conjugate of the block norm (^|| • Hi.p) = 5II ' lli.p* 0] where p* is the 



1 J 1 
p 
dual optimization problem 



conjugate exponent, i.e., - + ^ = 1. As a consequence, we obtain the following 



Dual MKL Optimization Problem 

sup -CJ2<^*{-^, y^)-(\\\-\\lp.®^\\-\\i) (Ea,<J>(=r,)). (D) 



4=1 



Note that the supremum is also a maximum, if the loss function is continuous. 
The function /® j- 1 1 • | P is the so-called Morea- Yosida Approximate [H] and has 
been studied extensively both theoretically and algorithmically for its favorable 
regularization properties. It can "smoothen" an optimization problem — even if 
it is initially non-differentiable — and it increases the condition number of the 
Hessian for twice differentiable problems. 

The above dual generalizes multiple kernel learning to arbitrary convex loss 
functions and regularizers. Due to the mathematically clean separation of the 
loss and the regularization term — each loss term solely depends on a single 
real valued variable — we can immediately recover the corresponding dual for a 
specific choice of a loss/regularizer pair {£,\\ ■ ||2,p) by computing the pair of 
conjugates (^M| • Ib.p*)- 

2.3 Obtaining Kernel Weights 

While formalizing multiple kernel learning with block-norm regularization offers 
a number of conceptual and analytical advantages, it requires an additional step 
in practical applications. The reason for this is that the block-norm regularized 
dual optimization criterion does not include explicit kernel weights. Instead, 
this information is contained only implicitly in the optimal kernel classifier pa- 
rameters, as output by the optimizer. This is a problem, for instance if one 
wishes to apply the induced classifier on new test instances. Here we need 
the kernel weights to form the final kernel used for the actual prediction. To 
recover the underlying kernel weights, one essentially needs to identify which 
kernel contributed to which degree for the selection of the optimal dual solution. 
Depending on the actual parameterization of the primal criterion, this can be 
done in various ways. 

We start by reconsidering the KKT optimality condition given by Eq. ([B]) 
and observe that the first term on the right hand side, 

9.n:^{\\w\\l-^\\w^\r^+f,y\ (8) 

introduces a scaling of the feature maps. With this notation, it is easy to see 
from Eq. ^ that our model given by Eq. ([l} extends to 

M n 

771=1 i=l 



In order to express the above model solely in terms of dual varables we have to 
compute 6 in terms of ol. 

In the following we focus on two cases. First, we consider tp block norm 
regularization for arbritrary 1 < p < oo while switching the elastic net off by 
setting the parameter /i = 0. Then, from Eq. ^ we obtain 



\Wr. 



\W 



p-2 
I p-1 



'H„ 



I2,p 5I"**™(^») 

Resubstitution into ^ leads to the proportionality 

n 

3c>0Vm: 9m = c \ NJ ai^rn{x 



where w,n = ^^ ^ ai^rn{xi). 



2-p 
p-1 



(9) 



«„ 



Note that, in the case of classification, we only need to compute up to a 
positive multiplicative constant. 

For the second case, let us now consider the elastic net regularizer, i.e., 
p = l + e with e « and /i > 0. Then, the optimality condition given by Eq. ^ 
translates to 



Wrn = 0,n ^ ai<^m[xi) where 



M 

El 

\m' = l 



1-e 



I'^rn'll'^ , 



l'^™ll«i +M 



Inserting the left hand side expression for ||t(;m||-H^^ into the right hand side 
leads to the non-linear system of equalities 



\m' = l I 



\K„ 



(10) 



where we employ the notation H-fiTmll '■— lEi^i Q^i^"i(^i)ll-H ■ ^^^ '^^^ exper- 
iments we solve the above conditions numerically using e w 0. The optimal 
mixing coefficients 6m can now be computed solely from the dual ct variables 
by means of Eq. Q and pO| , and by the kernel matrices Km using the identity 



Vm=l, •••,*/: ||i^m|| = \/aKma. 
This enables optimization in the dual space as discussed in the next section. 



3 Optimization Strategies 

In this section we describe how one can solve the dual optimization problem 
using an efficient quasi-Newton method. For our experiments, we use the hinge 
loss l{x) — max(0, 1 — x), but the discussion also applies to most other convex 



loss functions. We first note that the dual loss of the hinge loss is i*{t, y) — - ii 
— 1 < - < and oo elsewise T5i . Hence, for each i the term £* (— ^, Vi) of the 
generalized dual, i.e.. Optimization Problem (D), translates to —-ff^, provided 
that < — < C. Employing a variable substitution of the form a^^"" — — , the 
dual problem (D) becomes 



sup 1 ' a - ( - 

a: 0<q;<1 \ ^ 



and by definition of the Inf-convolution, 

1 




^atyi^ix,) 



sup 1 a 

a,f3: 0<a<l 



^a.yi'Pixi) - (3 



i=l 



2,p* 



^^wmi- 



(11) 



We note that the representer theorem [17^ is valid for the above problem, and 
hence the solution of (jlip can be expressed in terms of kernel functions, i.e., 
^m = S"=i likm{xi, •) for Certain real coefficients 7 G K" uniformly for all m, 
hence (3 = J2^=i li^i^i)- Thus, Eq. [TT]has a representation of the form 



sup 

a.,'y: 0<a<l 



2,p* 



2jl^^^^- 



The above expression can be writtero in terms of kernel matrices as follows. 
Hinge Loss Dual Optimization Problem 

M 



sup I'a--^ ( J(aoy-7)Tjs:„j(Q:oy-7) 1 _ ^Ti^7, 

a. 7: 0<a<l ^ V / m=l ^/^ 

(D') 

where we denote hy xoy the elementwise multiplication of two vectors and use 
the shorthand K — J2m=i ^rn- 



4 Theoretical Results 

In this section we give two uniform convergence bounds for the generalization 
error of the multiple kernel learning formulation presented in Section [2l The 
results are based on the established theory on Rademacher complexities. Let 
(Ti, . . . , (T„ be a set of independent Rademacher variables, which obtain the val- 
ues -1 or +1 with the same probability 0.5. and let C be some space of classifiers 
c : A" — > M. Then, the Rademacher complexity of C is given by 



■Rr 



E 



^We employ the notation s = (si, 



1 
sup - y^ (TiC{Xi) 
ccr n ^ — ' 



lY = (^™)"=i for ^ G 



If the Rademacher complexity of a class of classifiers is known, it can be used 
to bound the generalization error. We give one result here, and refer to the 
literature [1] for further results on Rademacher penalization. 

Theorem 1 Assume the loss ^ : K — > R has £{0) — 0, is Lipschitz with constant 
L and £{x) < 1 for all x. Then, the following holds with probability larger than 
1 — S for all classifiers c ^ C: 



1 



E[£(yc(a;))] < -Y(Xy,c{xi)) + 2Lnc 

71 ^ ^ 



/81n^ 



(12) 



We will now give an upper bound for the Rademacher complexity of the block- 
norm regularized linear learning approach described above. More precisely, for 
1 < i < M let lluijl^i := y^ki{w, w) denote the norm induced by kernel ki and 
for xeRP,p,q>l and Ci, C2 > with Ci + C2 = 1 define 



||x||o :=Ci||a;||p + C2||x||g. 
We now give a bound for the following class of linear classifiers: 

ll^ilUi 



C. := 




Wl 



WM 




\wm\Um 



< 1 



Theorem 2 Assume the kernels are normalized, i.e. ki{x,x) 



< 1 for 



all x G X and all \ < i < M . Then, the Rademacher complexity of the class C* 
of linear classifiers with block norm regularization is upper-bounded as follows: 



T^c, < 




(13) 



For the special case with p >2 and q > 2, the bound can be improved as follows: 



^C. < 



M 



CiMp +C2M~» 



(14) 



It is instructive to compare this result to some of the existing MKL bounds in 
the literature. For instance, the main result in [6 bounds the Rademacher com- 
plexity of the li-noTm. regularizer with a 0{y^\nM/n) term. We get the same 
result by setting Ci = 1, C2 = and p = 1. For the ^2-norm regularized setting, 
we can set Ci — 1,C2 = and p = | (because the kernel weight formulation 
with £2 norm corresponds to the block-norm representation with p — g) to re- 
cover their 0{M^ /y/n) bound. Finally, it is interesting to see how changing the 
Ci parameter infiuences the generalization capacity of the elastic net regularizer 
{p = 1,(7 = 2). For Ci = 1, we essentially recover the £1 regularization penalty. 



but as Ci approaches 0, the bound includes an additional 0{\/M) term. This 
shows how the capacity of the elastic net regularizer increases towards the (.2 
setting with decreasing sparsity. 
Proof [of Theorem [2] Using the notation w := (lui, . . . , WAf)^ and \\w\\b '■= 

||(|lit;i|Ui,...,||wM|Uj\/)'^||o it is easy to see that 

T 



E 



1 " 

sup - y^ <7iyic{a 



c£C, n 



E 



E 



wi 



^ELi^.$i(^ 



sup 

||u-||b<1 



/ ii^e:Ii^.*i(^.)iui 



V eEr=i^.*M(a:.)iu 



M 



where ||x||* := supj,{z"^x|||z|| < 1} denotes the dual norm of 
fact that \\w\\*B = ||(|ki||:i,...,|kM||:M)^llo [2], and that 
will show that this quantity is upper bounded by 

M I /21nM 



CiMp +C2M^ y n 

As a first step we prove that for any x £ M*^ 



and we use the 

l!„ - ll.|l«. We 



(15) 



Mh< 



M 



rrll^llc 



(16) 

CiMp + C2M^ 

For any a > 1 we can apply Holder's inequality to the dot product of x G M*^ 
and 1m ;= (1, . . . , 1)^ and obtain ||a;||i < ||1m||^^ ■ ||a;||a = M^\\x\\a. Since 
Ci + C2 = 1, we can apply this twice on the two components of ||.||o to get a 
lower bound for ||x||o, 

{CiM^+C2M^)\\xh < Ci\\x\\p + C2\\x\\g = \\x\\o. 
In other words, for every x e M*^ with ||a;||o < 1 it holds that 



a;||i < 1/ (CiMV + CzAfV 



M/ (CiMp +C2M^] 



Thus, 



{z^a;|||x|lo <1}C I 



T 
Z X 



\xh < 



M 



CiMp + CsAf ^ 

This means we can bound the dual norm ||.||o of ||.||o as follows: 
Ikllo = sup{z^a;|||z||o < 1} 

Z 

M 

UN < , 



(17) 



< sup <^ z x 

M 



CiMp + C2M^ 



CiMp + C2M^ 

■^ 00 • 



(18) 



This accounts for the first factor in (IT5]) . For the second factor, we show that 



E 



To do so, define 



^Er=i<^«*A/(a;OIUM 



< 



21nAf 



(19) 



Vk-.^ 






-^i ; -^j } 



i=i j=i 



By the independence of the Rademacher variables it follows for all k < M, 

1 " 1 

E[Vk] = ^y2'E[kk{x„x^)]<-. (20) 

i=\ 

In the next step we use a martingale argument to find an upper bound for 
supj.[PFfc] where Wk '■= \fVk — E[-\/W]. For ease of notation, we write E(,-)[v'r] 
to denote the conditional expectation E[X|(a:;i, cri), . . . [xr^ (yr)\- We define the 
following martingale: 



7{r) ._ 



E[VW]- E \^/Vk\ 



(r) 



E 

(r) 



(r-1) 



1 " 

- Vcr,$fc(a 

Tt ^ ^ 



tfeJ 



E 

(r-l) 



1 " 

-y'crj$fc(a;i) 



^/cJ 



(21) 



.('■) 



The range of each random variable Z^' '' is at most — . This is because switching 
the sign of Or changes only one summand in the sum from — ^^(xr) to +$/c(xr). 
Thus, the random variable changes by at most || -$fc(2;r)||*/c < -kk{xr, Xr) < 1- 

Hence, we can apply HoefFding's inequality, E(r-i) 



sZi"^ 



1 „2 
< 62^ . 



This allows us to bound the expectation of supj, Wk as follows: 



E[supWfc] =E 
k 



— In sup e 

s k 



sWk 



<E 

1 



exp 



k=l 



k= 
M 

< -IuVe 

~ s ^^ 
k=\ 



'Ezr 



(r) 






E 

(n-l) 



e^^^ 



fe=l 
InM s 
s 2n 



10 



E[supM^fc] < 

k 



where we n times applied Hoeffding's inequality. Setting s = y/2n In M yields: 

. 22 

V n 

Now, we can combine ((20)) and (|22l) : 



E 



sup vVfc 
fc 



<E 



supiyfc + \/E[Vfe] 



< 



21nM 



This concludes the proof of p^ and therewith (|13p . 

The special case (|14p for p, g > 2 is similar. As a first step, we modify ([TF 
to deal with the £2-norm rather than the £oo-norm: 



l|:^llo< 



M 



-II^IU 



(23) 



To see this, observe that for any x E K*^ and any a > 2 Holder's inequality 
gives ||a;||2 < A'^~^^||2;||a- Applying this to the two components of ||.||o we have: 

(CiA'f ^ + C2m'-^)\\x\\2 < Ci\\x\\p + C2\\x\\, - ||.t||o • 

In other words, for every x e R^^ with ||a;||o < 1 it holds that 

Ikll2 < 1/ (CiM^ + C2M^\ == VM/ fCiAf p + C2M^\ . 

Following the same arguments as in (|17p and (fT8|) we obtain ([23l) . To finish the 
proof it now suffices to show that 



^e:Li<^.<&i(^oiui 



^Er=i<^«*M(a;i)IUAf 



< 



This is can be seen by a straightforward application of pO| : 



E 



M 



\ k=l 



1 " 
n ^ — ^ 



i=l 



2 " 






r j\/ "1 


*k_ 


'\ 


E 


.fc=i 



^S" 



5 Empirical Results 

In this section we evaluate the proposed method on artifical and real data sets. 
We chose the limited memory quasi-Newton software L-BFGS-B |22] to solve 
(jD'P . L-BFGS-B approximates the Hessian matrix based on the last t gradients, 
where i is a parameter to be chosen by the user. 
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Figure 1: Empirical results of the artificial experiment for varying true underlying 
data sparsity. 



5.1 Experiments with Sparse and Non-Sparse Kernel Sets 

The goal of this section is to study the relationship of the level of sparsity of the 
true underlying function to the chosen block norm or elastic net MKL model. 
Apart from investigating which parameter choice leads to optimal results, we 
are also interested in the effects of suboptimal choices of p. To this aim we 
constructed several artificial data sets in which we vary the degree of sparsity in 
the true kernel mixture coefficients. We go from having all weight focussed on a 
single kernel (the highest level of sparsity) to uniform weights (the least sparse 
scenario possible) in several steps. We then study the statistical performance of 
£p-block-norm MKL for different values of p that cover the entire range [0,oo]. 
We follow the experimental setup of [8] but compute classification models for 
p = 1, 4/3, 2, 4, oo block-norm MKL and /i = 10 elastic net MKL. The results are 
shown in Fig. [1] and compared to the Bayes error that is computed analytically 
from the underlying probability model. 

Unsurprisingly, i\ performs best in the sparse scenario, where only a single 
kernel carries the whole discriminative information of the learning problem. In 
contrast, the foo-norm MKL performs best when all kernels are equally infor- 
mative. Both MKL variants reach the Bayes error in their respective scenar- 
ios. The elastic net MKL performs comparable to €i-block-norm MKL. The 
non-sparse ^4/3-norm MKL and the unweighted-sum kernel SVM perform best 
in the balanced scenarios, i.e., when the noise level is ranging in the interval 
60%-92%. The non-sparse ^4-norm MKL of [2] performs only well in the most 
non-sparse scenarios. Intuitively, the non-sparse ^4/3-norm MKL of O [7] is the 
most robust MKL variant, achieving an test error of less than 0.1% in all sce- 
narios. The sparse ^i-norm MKL performs worst when the noise level is less 



12 



Table 1: Results for the bioinformatics experiment. 





AUG ± stderr 


^ = 0.01 clastic net 


85.80 ±0.21 


^ = 0.1 clastic net 


85.66 ±0.15 


^ = 1 elastic net 


83.75 ±0.14 


^ = 10 elastic net 


84.56 ±0.13 


^ — 100 elastic net 


84.07 ±0.18 


l-block-norm MKL 


84.83 ±0.12 


4/3-block-norm MKL 


85.66 ±0.12 


2-block-norni MKL 


85.25 ±0.11 


4-block-norni MKL 


85.28 ±0.10 


oo-block-norm MKL 


87.67 ±0.09 



than 82%. It is worth mentioning that when considering the most challenging 
model/scenario combination, that is i?oo-norni in the sparse and ^i-norm in the 
uniformly non-sparse scenario, the £i-norm MKL performs much more robust 
than its £,^ counterpart. However, as witnessed in the following sections, this 
does not prevent ioo norm MKL from performing very well in practice. In sum- 
mary, we conclude that by tuning the sparsity parameter p for each experiment, 
block norm MKL achieves a low test error across all scenarios. 

5.2 Gene Start Recognition 

This experiment aims at detecting transcription start sites (TSS) of RNA Poly- 
merase II binding genes in genomic DNA sequences. Accurate detection of the 
transcription start site is crucial to identify genes and their promoter regions 
and can be regarded as a first step in deciphering the key regulatory elements 
in the promoter region that determine transcription. 

Many detectors thereby rely on a combination of feature sets which makes 
the learning task appealing for MKL. For our experiments we use the data 
set from J20] and we employ five different kernels representing the TSS signal 
(weighted degree with shift), the promoter (spectrum), the 1st exon (spectrum), 
angles (linear), and energies (linear). The kernel matrices are normalized such 
that each feature vector has unit norm in Hilbert space. We reserve 500 and 500 
randomly drawn instances for holdout and test sets, respectively, and use 1,000 
as the training pool from which 250 elemental training sets are drawn. Table [T] 
shows the area under the ROC curve (AUG) averaged over 250 repetitions of 
the experiment. Thereby 1 and oo block norms are approximated by 64/63 and 
64 norms, respectively. For the elastic net we use an ^1.05-block-norm penalty. 

The results vary greatly between the MKL models. The elastic net model 
gives the best prediction for fi = 0.01 by essentially approximating the ^1.05- 
block-norm MKL. Out of the block norm MKLs the classical ^i-norm MKL has 
the worst prediction accuracy and is even outperformed by an unweighted-sum 
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Table 2: Results for the intrusion detection experiment. 





AUCo.i ± stderr 


fi = 0.01 elastic net 


99.36 ±0.14 


^ = 0.1 elastic net 


99.46 ±0.13 


^ = 1 elastic net 


99.38 ±0.12 


^ = 10 elastic net 


99.43 ±0.11 


/i = 100 elastic net 


99.34 ±0.13 


1-block-norm MKL 


99.41 ±0.14 


4/3-block-norm MKL 


99.20 ±0.15 


2-block-norm MKL 


99.25 ±0.15 


4-block-norm MKL 


99.14 ±0.16 


oo-block-norm MKL 


99.68 ±0.09 



kernel SVM (i.e., p — 2 norm MKL). In accordance with previous experiments 
in [7] the p — 4/3-block-norni has the highest prediction accuracy of the models 
within the parameter range p d [1,2]. Surprisingly, this superior performance 
can even be improved considerably by the recent €oo-block-norm MKL of jl4| . 
This is remarkable, and of significance for the application domain: the method 
using the unweighted sum of kernels [20] has recently been confirmed to be the 
leading in a comparison of 19 state-of-the-art promoter prediction programs [1], 
and our experiments suggest that its accuracy can be further improved by £oo 
MKL. 

5.3 Network Intrusion Detection 

For the intrusion detection experiments we use the data set described in [9] 
consisting of HTTP traffic recorded at Fraunhofer Institute FIRST Berlin. The 
unsanitized data contains 500 normal HTTP requests drawn randomly from 
incoming traffic recorded over two months. Malicious traffic is generated using 
the Metasploit framework [H] and consists of 30 instances of 10 real attack 
classes from recent exploits, including buffer overflows and PHP vulnerabilities. 
Every attack is recorded in different variants using virtual network environments 
and decoy HTTP servers. 

We deploy 10 spectrum kernels [TTJ [18] for 1,2,..., 10-gram feature repre- 
sentations. All data points are normalized to unit norm in feature space to avoid 
dependencies on the HTTP request length. We randomly split the normal data 
into 100 training, 200 validation and 250 test examples. We report on average 
areas under the ROC curve in the false-positive interval [0, 0.1] (AUC[o,o.i]) over 
100 repetitions with distinct training, holdout, and test sets. 

Table [2] shows the results for multiple kernel learning with various norms 
and elastic net parameters A. The overall performance of all models is relatively 
high which is typical for intrusion detection applications, where very small 
false positive rates are crucial. The elastic net instantiations perform relatively 
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similar where ^ = 0.1 is the most accurate one. It reaches about the same level 
as f i-block-norm MKL, which performs better than the non-sparse ^4/3-norm 
MKL, the ^4-norm MKL, and the SVM with an unweighted-sum kernel. Out 
of the block norm MKL versions — as already witnessed in the bioinformatics 
experiment — £oo-norm MKL gives the best predictor. 

6 Conclusion 

We presented a framework for multiple kernel learning, that unifies several recent 
lines of research in that area. We phrased the seemingly different MKL variants 
as a single generalized optimization criterion and derived its dual. By plugging 
in an arbitrary convex loss function many existing approaches can be recovered 
as instantiations of our model. We compared the different MKL variants in 
terms of their generalization performance by giving an concentration inequality 
for generalized MKL that matches the previous known bounds for £1 and £4/3 
MKL. We showed on artificial data how the optimal choice of an MKL model 
depends on the properties of the true underlying scenario. We compared several 
existing MKL instantiations on bioinformatics and network intrusion detection 
data. Surprisingly, our empirical analysis shows that the recent uniformly non- 
sparse ioo MKL of [H] outperforms its sparse and non-sparse competitors in 
both practical cases. It is up to future research to determine whether this 
empirical success also translates to other loss functions than hinge loss and 
other performance measures than the area under the ROC curve. 
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